01 Machine Learning Fundamentals
1 Challenge Summary
Your organization wants to know which companies are similar to each other to help in identifying potential customers of a SAAS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments.
You will be using stock prices in this analysis. You come up with a method to classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).
You can analyze the stock prices using what you’ve learned in the unsupervised learning tools including K-Means and UMAP. You will use a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns.
2 Objectives
Apply knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in S&P 500 Index. You will specifically apply:
- Modeling:
kmeans()andumap() - Iteration:
purrr - Data Manipulation:
dplyr,tidyr, andtibble - Visualization:
ggplot2(bonusplotly)
3 Libraries
Loading libraries
4 Data
We will be using stock prices in this analysis. Although some of you know already how to use an API to retrieve stock prices I obtained the stock prices for every stock in the S&P 500 index for you already. The files are saved in session_6_data directory
We can read in stock prices. The data is 1.2M observations. The important columns for analysis are:
symbol: The stock ticker symbol that corresponds to company’s stock pricedate: The timestamp relating the symbol to share price at that point in timeadjusted: The stock price, adjusted for any splits and dividends (we use this while analyzing stock data over long periods of time)
# STOCK PRICES
sp_500_prices_tbl <- read_rds("C:/Users/Jeevan/OneDrive/Documents/TUHH/Business decision with Machine Learning/Business Decisions with Machine Learning/sp_500_prices_tbl.rds")
sp_500_prices_tblThe second data frame contains info about stocks the most important are:
company: The company namesector: The sector that company belongs to
5 Question
Which stock prices behave similarly?
Answering this question helps us understand which companies are related, and we can use clustering to help us answer it!
Even if you’re not interested in finance, this is still a great analysis because it will tell you which companies are competitors and which are likely in the same space (often called sectors) and can be categorized together. Bottom line - This analysis can help you better understand the dynamics of the market and competition, which is useful for all types of analyses from finance to sales to marketing.
5.1 Step 1 - Convert stock prices to standard format - daily returns
What you first need to do is get the data in a format that can be converted to “user-item” style matrix. The challenge here is to connect dots between what we have and what we need to do to format it properly.
We know that in order to compare the data, it needs to be standardized or normalized. Why? Because we cannot compare values (stock prices) that are of completely different magnitudes. In order to standardize, we convert from adjusted stock price (dollar value) to daily returns (percent change from previous day). Below we have the formula
\[ return_{daily} = \frac{price_{i}-price_{i-1}}{price_{i-1}} \]
First, what do we have? We have stock prices for every stock in the SP 500 Index, which is daily stock prices for over 500 stocks. The data set is over 1.2M observations.
#> Rows: 1,225,765
#> Columns: 8
#> $ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
#> $ date <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009-01-08, …
#> $ open <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.52, 19.53…
#> $ high <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.99, 19.68…
#> $ low <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.52, 19.01…
#> $ close <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.82, 19.09…
#> $ volume <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 49815300, 5…
#> $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15.23408, 1…
Your first task is to convert to a tibble named sp_500_daily_returns_tbl by performing the following operations:
- Select the
symbol,dateandadjustedcolumns - Filter to dates beginning in the year 2018 and beyond.
- Compute a Lag of 1 day on the adjusted stock price. Be sure to group by symbol first, otherwise we will have lags computed using values from the previous stock in the data frame.
- Remove a
NAvalues from the lagging operation - Compute the difference between adjusted and the lag
- Compute the percentage difference by dividing the difference by that lag. Name this column
pct_return. - Return only the
symbol,date, andpct_returncolumns - Save as a variable named
sp_500_daily_returns_tbl
# Apply data transformation skills.
sp_500_daily_returns_tbl <- sp_500_prices_tbl %>%
select(symbol, date, adjusted) %>%
filter(date >= "2018-01-01") %>%
group_by(symbol) %>%
mutate(adj_lag = lag(adjusted)) %>%
filter(!is.na(adj_lag)) %>%
mutate(diff = adjusted - adj_lag,
pct_return = diff / adj_lag) %>%
select(symbol, date, pct_return)
sp_500_daily_returns_tbl5.2 Step 2 - Convert it to User-Item Format
The next step is to convert to user-item format with the symbol in the first column and every other column value of the daily returns (pct_return) for every stock at each date.
We’re going to import correct results first (just in case you were not able to complete the last step).
sp_500_daily_returns_tbl <- read_rds("C:/Users/Jeevan/OneDrive/Documents/TUHH/Business decision with Machine Learning/Business Decisions with Machine Learning/sp_500_daily_returns_tbl.rds")
sp_500_daily_returns_tblNow that we have the daily returns (percentage change from one day to the next), we can convert to user-item format. The user in this case is the symbol (company), and the item in this case is the pct_return at each date.
- Spread the
datecolumn to get values as percentage returns. Make sure to fill anNAvalues with zeros. - Save the result as
stock_date_matrix_tbl
5.3 Step 3 - Performing K-Means Clustering
Next, we’ll perform K-Means clustering.
We’re going to import correct results first (just in case you were not able to complete last step).
Beginning with stock_date_matrix_tbl, perform the following operations:
- Drop the non-numeric column,
symbol - Perform
kmeans()withcenters = 4andnstart = 20 - Save the result as
kmeans_obj
# Creating kmeans_obj for 4 centers
kmeans_obj <- stock_date_matrix_tbl %>%
select(-symbol) %>%
kmeans(centers = 4, nstart = 20)
kmeans_obj %>% glance()Use glance() to get the tot.withinss.
5.4 Step 4 - Find optimal value for K
Now that we are familiar with the process for calculating kmeans(), let’s use purrr to iterate over many values of “k” using the centers argument.
We’ll use this custom function called kmeans_mapper():
Apply kmeans_mapper() and glance() functions iteratively using purrr.
- Create a tibble containing column called
centersthat go from 1 to 30 - Add a column named
k_meanswith thekmeans_mapper()output. Usemutate()to add the column andmap()to map centers to thekmeans_mapper()function. - Add a column named
glancewith theglance()output. Usemutate()andmap()again to iterate the column ofk_means. - Save the output as
k_means_mapped_tbl
# Using purrr to map
k_means_mapped_tbl <- tibble(centers = 1:30) %>%
mutate(k_means = centers %>% map(kmeans_mapper),
glance = k_means %>% map(glance))#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `k_means = centers %>% map(kmeans_mapper)`.
#> Caused by warning:
#> ! did not converge in 10 iterations
Next, let’s visualize the “tot.withinss” from the glance output as a Scree Plot.
- Begin with the
k_means_mapped_tbl - Unnest the
glancecolumn - Plot the
centerscolumn (x-axis) versus thetot.withinsscolumn (y-axis) usinggeom_point()andgeom_line() - Add a title “Scree Plot” and feel free to style it with your favorite theme
We can see that Scree Plot becomes linear (constant rate of change) between 5 and 10 centers for K.
5.5 Step 5 - Applying UMAP
Next, let’s plot UMAP 2D visualization to help us investigate cluster assignments.
We’re going to import the correct results first (just in case you were not able to complete the last step).
First, let’s apply umap() function to the stock_date_matrix_tbl, which contains our user-item matrix in tibble format.
- Start with
stock_date_matrix_tbl - De-select the
symbolcolumn - Use the
umap()function storing the output asumap_results
Next, we want to combine layout from the umap_results with the symbol column from the stock_date_matrix_tbl.
- Start with
umap_results$layout - Convert from a
matrixdata type to atibblewithas_tibble() - Bind the columns of the umap tibble with the
symbolcolumn from thestock_date_matrix_tbl. - Save the results as
umap_results_tbl.
Finally, let’s make a quick visualization of the umap_results_tbl.
- Pipe the
umap_results_tblintoggplot()mapping the columns to x-axis and y-axis - Add a
geom_point()geometry with analpha = 0.5 - Apply
theme_tq()and add a title “UMAP Projection”
We can now see that we have some clusters. However, we still need to combine the K-Means clusters and the UMAP 2D representation.
5.6 Step 6 - Combining K-Means and UMAP
Next, we combine the K-Means clusters and the UMAP 2D representation
We’re going to import the correct results first (just in case you were not able to complete the last step).
k_means_mapped_tbl <- read_rds("C:/Users/Jeevan/OneDrive/Documents/TUHH/Business decision with Machine Learning/Business Decisions with Machine Learning/k_means_mapped_tbl.rds")
umap_results_tbl <- read_rds("C:/Users/Jeevan/OneDrive/Documents/TUHH/Business decision with Machine Learning/Business Decisions with Machine Learning/umap_results_tbl.rds")First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot flattens. Have a look at the business case to recall how that works.
Next, we’ll combine the clusters from the k_means_obj with the umap_results_tbl.
- Begin with the
k_means_obj - Augment the
k_means_objwith thestock_date_matrix_tblto get the clusters added to the end of the tibble - Select just the
symboland.clustercolumns - Left join the result with the
umap_results_tblby thesymbolcolumn - Left join the result with the result of
sp_500_index_tbl %>% select(symbol, company, sector)by thesymbolcolumn. - Store the output as
umap_kmeans_results_tbl
Plot the K-Means and UMAP results.
- Begin with the
umap_kmeans_results_tbl - Use
ggplot()mappingV1,V2andcolor = .cluster - Add the
geom_point()geometry withalpha = 0.5 - Apply colors as you desire (e.g.
scale_color_manual(values = palette_light() %>% rep(3)))
Congratulations! You are done with the 1st challenge!
